SWIM: Scalable Weakly-consistent Infection-style Process Group Membership Protocol

نویسندگان

  • Abhinandan Das
  • Indranil Gupta
  • Ashish Motivala
چکیده

Several distributed peer-to-peer applications require weakly-consistent knowledge of process group membership information at all participating processes. SWIM is a generic software module that offers this service for largescale process groups. The SWIM effort is motivated by the unscalability of traditional heart-beating protocols, which either impose network loads that grow quadratically with group size, or compromise response times or false positive frequency w.r.t. detecting process crashes. This paper reports on the design, implementation and performance of the SWIM sub-system on a large cluster of commodity PCs. Unlike traditional heartbeating protocols, SWIM separates the failure detection and membership update dissemination functionalities of the membership protocol. Processes are monitored through an efficient peer-to-peer periodic randomized probing protocol. Both the expected time to first detection of each process failure, and the expected message load per member, do not vary with group size. Information about membership changes, such as process joins, drop-outs and failures, is propagated via piggybacking on ping messages and acknowledgments. This results in a robust and fast infection style (also epidemic or gossipstyle) of dissemination. The rate of false failure detections in the SWIM system is reduced by modifying the protocol to allow group members to suspect a process before declaring it as failed this allows the system to discover and rectify false failure detections. Finally, the protocol guarantees a deterministic time bound to detect failures. Experimental results from the SWIM prototype are presented. We discuss the extensibility of the design to a WANwide scale. Author last names are in alphabetical order. The authors were supported in part by NSF CISE grant 9703470, in part by DARPA/AFRLIFGA grant F30602-99-1-0532, and in part by a grant under NASA’s REE program, administered by JPL.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Case for Epidemic Fault Detection and Group Membership in HPC Storage Systems

Fault response strategies are crucial to maintaining performance and availability in HPC storage systems, and the first responsibility of a successful fault response strategy is to detect failures and maintain an accurate view of group membership. This is a nontrivial problem given the unreliable nature of communication networks and other system components. As with many engineering problems, tr...

متن کامل

Enforcing Routing Consistency in Structured Peer-to-Peer Overlays: Should We and Could We?

In this paper, we argue that enforcing routing consistency in keybased routing (KBR) protocols can simplify P2P application design and make structured P2P overlays suitable for more applications. We define two levels of routing consistency semantics, namely weakly consistent KBR and strongly consistent KBR. We focus on an algorithm that provides strong consistency based on group membership serv...

متن کامل

Scalable, Fault-tolerant Membership for Group Communication on Hpc Systems

VARMA, JYOTHISH S. Scalable, Fault-Tolerant Membership for Group Communication on HPC Systems. (Under the direction of Associate Professor Dr. Frank Mueller). Reliability is increasingly becoming a challenge for high-performance computing (HPC) systems with thousands of nodes, such as IBM’s Blue Gene/L. A shorter mean-timeto-failure can be addressed by adding fault tolerance to reconfigure work...

متن کامل

Scalable secure one-to-many group communication using dual encryption

Multicasting is a scalable solution for group communication. Whereas secure unicast is a well-understood problem, scalable secure multicast poses several unique security problems, namely groupmembership control, scalable key distribution to a dynamic group. We address scalability in the pro-posed protocol by using hierarchical subgrouping. Third party hosts or members of the mul...

متن کامل

Group membership in the epidemic style

Existing group membership mechanisms provide consistent views of membership changes. However, they require heavyweight synchronous multicast protocols. We present a new lightweight group membership mechanism that allows temporary inconsistencies in membership views. This mechanism uses epidemic communication techniques to ensure that all group members eventually converge to a consistent view of...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2002